Collect Delta extended statistics when creating table by pajaks · Pull Request #15878 · trinodb/trino

pajaks · 2023-01-27T12:29:24Z

Description

Collect delta lake statistics for CREATE TABLE AS.

Additional context and related issues

Release notes

(x) Release notes are required, with the following suggested text:

# Delta Lake
* Collect statistics for CREATE TABLE AS

pajaks · 2023-01-30T13:17:07Z

rebase on master to use CI fix #15879

alexjo2144

Looks pretty good overall. Couple questions/nitpicks

alexjo2144 · 2023-01-30T15:46:35Z

maybe just collectExtendedColumnStatisticsOnWrite ?

alexjo2144 · 2023-01-30T15:50:34Z

Save the result of extractColumnMetadata so that you don't have to call it again at the bottom of this method.

Suggested change

Set<String> allColumnNames = extractColumnMetadata(metadata, typeManager).stream()

.map(ColumnMetadata::getName)

.collect(toImmutableSet());

List<ColumnMetadata> columnMetadata = extractColumnMetadata(metadata, typeManager);

Set<String> allColumnNames = columnMetadata.stream()

.map(ColumnMetadata::getName)

.collect(toImmutableSet());

alexjo2144 · 2023-01-30T15:50:57Z

Per other comment, don't have to call extractColumnMetadata again.

alexjo2144 · 2023-01-30T15:55:32Z

Why not includeMaxFileModifiedTime in this situation?

Statistic aggregation during table creation does not have information about file_modified_time yet.

Right, right. Then if the modified time isn't present we just use the current time when the collection is done. Makes sense.

Can you please add a code comment explaining this consideration?
What do we need to have this information available?

alexjo2144 · 2023-01-30T15:58:05Z

We should still test the old thing too

findinpath · 2023-01-30T17:15:00Z

Please do create a compatibility test with spark to verify that after a CTAS DESC EXTENDED works as intended on Databricks

Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.

findinpath · 2023-01-30T17:15:25Z

Do consider documenting this new property in delta-lake.rst - either in this PR or a follow-up PR

I would wait with documentation until other write operations are implemented if that's ok.

alexjo2144

👍

ebyhr · 2023-02-01T06:09:49Z

nit: There's no need to change this line. I would revert.

Reduce map iterations and lookups to minimum, while also simplifying the code flow.

findinpath · 2023-02-01T14:09:03Z

nit: this kind of cosmetic changes can be done in a separate commit.

findinpath · 2023-02-01T14:12:39Z

separate commit

This change make sense only with this commit as it allows collection to have 0 elements. It should throw exception before.

findinpath · 2023-02-01T14:15:14Z

Nevermind. Trino Delta Lake (on the storage layer) & Databricks (on the metastore properties) have outputs in different places.

findinpath · 2023-02-01T14:31:19Z

Can you please add a code comment explaining this consideration?
What do we need to have this information available?

pajaks · 2023-02-02T14:44:05Z

CI #12535, #15809

findinpath · 2023-02-03T08:56:00Z

Time is added during statistics update.

Do you mean Maximum File modified time ?

findepi · 2023-02-03T15:43:50Z

updateTableStatistics( session,

findepi · 2023-02-03T15:46:35Z

this line is now over line length limit, so --

we put all arguments on one line, or each on separate line

findepi · 2023-02-03T15:50:20Z

that's minimal change, but that's not how you'd write the code if you were writing the code anew.

.flatMap(entry -> { .... if (....) { return Stream.of(); } return Stream.of(Instant.ofEpochMilli(....)); }) .collect(toOptional());

findepi · 2023-02-03T15:53:31Z

It sounds like a problem and a workaround, but there isn't a problem

// File modified time does not need to be collected as a statistics because it gets derived directly from files being written false);

findepi · 2023-02-03T15:55:49Z

test_analyze_ -> test_ctats_stats_

findepi · 2023-02-03T15:57:08Z

can you paste this method contents into testCreateTableAsStatistics above?

testCreateTableAsStatistics has good name and a javadoc, just the contents are worse

findepi · 2023-02-03T16:01:57Z

nit: unrelated fmt change

findepi · 2023-02-03T16:02:01Z

nit: unrelated fmt change

findepi · 2023-02-03T16:02:24Z

nit: each arg on separate line

findepi · 2023-02-03T16:04:21Z

i know it's preexisting but i don't think we need to assert split count in every test method here. It blurs the test's intent
(perhaps, we don't need it in any test, i don't know, but i am not requesting any change to existing tests)

this would be better:

assertUpdate("ANALYZE " + tableName);

findepi · 2023-02-03T16:06:03Z

@pajaks @findinpath @alexjo2144 thank you, this is awesome!

In particular this improves Delta query performance on data sets created in the connector using CTAS.

pajaks · 2023-02-07T10:42:26Z

CI:
suite-7-non-generic #14441
suite-iceberg #16013

cla-bot Bot added the cla-signed label Jan 27, 2023

github-actions Bot added the tests:hive label Jan 27, 2023

pajaks marked this pull request as ready for review January 30, 2023 12:45

pajaks requested review from alexjo2144, ebyhr, findepi and findinpath January 30, 2023 12:45

alexjo2144 reviewed Jan 30, 2023

View reviewed changes

findinpath reviewed Jan 30, 2023

View reviewed changes

alexjo2144 approved these changes Jan 31, 2023

View reviewed changes

ebyhr approved these changes Feb 1, 2023

View reviewed changes

findepi added 3 commits February 1, 2023 08:35

Fix typo

d7976ba

Remove unnecessary variable

0b55cb6

Simplify map merging in Delta finishStatisticsCollection

6b0b80b

Reduce map iterations and lookups to minimum, while also simplifying the code flow.

findinpath reviewed Feb 1, 2023

View reviewed changes

Extract statistics update to separate method

04165ef

findinpath reviewed Feb 3, 2023

View reviewed changes

findepi approved these changes Feb 3, 2023

View reviewed changes

alexjo2144 mentioned this pull request Feb 3, 2023

Allow forcing Delta Lake analyze to ignore previous analysis time #15968

Closed

findepi and others added 2 commits February 6, 2023 10:41

Collect Delta extended statistics during table creation

2b0ae97

In particular this improves Delta query performance on data sets created in the connector using CTAS.

empty

22a27e6

findepi approved these changes Feb 6, 2023

View reviewed changes

empty

97e5fe8

findepi merged commit 6ed0ad5 into trinodb:master Feb 7, 2023

findepi mentioned this pull request Feb 7, 2023

Release notes for 407 #15854

Closed

github-actions Bot added this to the 407 milestone Feb 7, 2023

pajaks deleted the findepi/delta-analyze-on-write branch February 8, 2023 08:14

colebow mentioned this pull request Feb 10, 2023

Add Trino 407 release notes #15919

Merged

findepi mentioned this pull request Mar 29, 2023

Collect Delta extended statistics during writes #14575

Closed

-        Set<String> allColumnNames = extractColumnMetadata(metadata, typeManager).stream()
-                .map(ColumnMetadata::getName)
-                .collect(toImmutableSet());
+        List<ColumnMetadata> columnMetadata = extractColumnMetadata(metadata, typeManager);
+        Set<String> allColumnNames = columnMetadata.stream()
+                .map(ColumnMetadata::getName)
+                .collect(toImmutableSet());

Conversation

pajaks commented Jan 27, 2023 • edited by ebyhr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

pajaks commented Jan 30, 2023

Uh oh!

alexjo2144 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alexjo2144 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pajaks commented Feb 2, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

findepi commented Feb 3, 2023

Uh oh!

pajaks commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

pajaks commented Jan 27, 2023 •

edited by ebyhr

Loading

pajaks commented Feb 7, 2023 •

edited

Loading